library(tidyverse)
library(plotly)
library(ggpubr)
library(rstatix)
2.1. Read in the gapminder_clean.csv data as a tibble using read_csv.
gapminder_clean <- read_csv("gapminder_clean.csv")
2.2. Filter the data to include only rows where Year is 1962 and then make a scatter plot comparing ‘CO2 emissions (metric tons per capita)’ and gdpPercap for the filtered data.
gapminder_1962 <- filter(gapminder_clean, Year==1962)
ggplot(gapminder_1962, aes(x=`CO2 emissions (metric tons per capita)`, y=gdpPercap)) + geom_point() + ggtitle ("Scatter plot")
2.3. On the filtered data, calculate the correlation of ‘CO2 emissions (metric tons per capita)’ and gdpPercap. What is the correlation and associated p value?
cor.test(gapminder_1962$`CO2 emissions (metric tons per capita)`, gapminder_1962$gdpPercap)
##
## Pearson's product-moment correlation
##
## data: gapminder_1962$`CO2 emissions (metric tons per capita)` and gapminder_1962$gdpPercap
## t = 25.269, df = 106, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8934697 0.9489792
## sample estimates:
## cor
## 0.9260817
There is a positive correlation of 0.926 (p-val <2.2e-16).
2.4. On the unfiltered data, answer “In what year is the correlation between ‘CO2 emissions (metric tons per capita)’ and gdpPercap the strongest?” Filter the dataset to that year for the next step…
max_year_cor <- gapminder_clean %>% group_by(Year)%>%
summarize(year_cor= cor(`CO2 emissions (metric tons per capita)`, gdpPercap, use="complete.obs")) %>% top_n(n=1, wt=year_cor) %>% print()
## # A tibble: 1 × 2
## Year year_cor
## <dbl> <dbl>
## 1 1967 0.939
gapminder_1967 <- filter(gapminder_clean, Year==1967)
The correlation is strongest in 1967 (0.9388).
2.5. Using plotly, create an interactive scatter plot comparing ‘CO2 emissions (metric tons per capita)’ and gdpPercap, where the point size is determined by pop (population) and the color is determined by the continent. You can easily convert any ggplot plot to a plotly plot using the ggplotly() command.
plotly_plot <- ggplot(gapminder_1967, aes(x=`CO2 emissions (metric tons per capita)`, y=gdpPercap, size=pop, color=continent)) + geom_point() + ggtitle("Plotly scatter plot")
ggplotly(plotly_plot)
2.6. What is the relationship between continent and ‘Energy use (kg of oil equivalent per capita)’? (stats test needed).
gapminder_energy <- select(gapminder_clean, continent, `Energy use (kg of oil equivalent per capita)`) %>% filter(complete.cases(.))
norm_test_energy <- gapminder_energy %>% group_by(continent) %>% summarize(shap_test_p_val= shapiro.test(`Energy use (kg of oil equivalent per capita)`)$p.value)
res.kruskal <-kruskal_test(`Energy use (kg of oil equivalent per capita)` ~ continent, data = gapminder_energy)
pwc <- gapminder_energy %>%
wilcox_test(`Energy use (kg of oil equivalent per capita)` ~ continent, p.adjust.method = "bonferroni")
pwc <- pwc %>% add_xy_position(x = "continent")
ggboxplot(gapminder_energy, x = "continent", y = "Energy use (kg of oil equivalent per capita)") +
stat_pvalue_manual(pwc, hide.ns = TRUE) +
labs(
subtitle = get_test_label(res.kruskal, detailed = TRUE),
caption = get_pwc_label(pwc)
)+ ggtitle ("Energy use by continent")
The normality test shows that the data are not normally distributed. Therefore, to compare all groups we need a non-parametric test (like Kruskal-Wallis H Test). We obtained a p-val < 0.0001, which indicates that the energy use in at least one of the continents differs from the others.
We then performed the correspondent pair-wise comparisons to check which continents differ from each other and we could see that all comparisons have significant p-values, indicating that the energy use in each continent differs significantly from the energy use in the other continents, with Oceania having the highest energy use and Africa the lowest.
2.7. Is there a significant difference between Europe and Asia with respect to ‘Imports of goods and services (% of GDP)’ in the years after 1990? (stats test needed).
gapminder_ae_1990 <- filter(gapminder_clean, Year >1990 & (continent=="Europe" | continent =="Asia"))
norm_test_1990 <- gapminder_ae_1990 %>% group_by(continent) %>% summarize(shap_test_p_val= shapiro.test(`Imports of goods and services (% of GDP)`)$p.value) %>% print()
## # A tibble: 2 × 2
## continent shap_test_p_val
## <chr> <dbl>
## 1 Asia 0.0000000231
## 2 Europe 0.0000136
The normality test shows that the data are not normally distributed. Therefore, to compare both groups we need a non-parametric test (like the Mann–Whitney or Wilcoxon rank-sum test).
ggplot(gapminder_ae_1990, aes(x=continent, y=`Imports of goods and services (% of GDP)`, fill=continent)) + geom_violin() + stat_compare_means(method = "wilcox.test", label.x = 1.4, label.y = 200)+
theme(legend.position="none") + ggtitle ("Imports in Europe vs Asia")
No, there is not a significant difference between Europe and Asia with respect to ‘Imports of goods and services (% of GDP)’ in the years after 1990.
2.8. What is the country (or countries) that has the highest ‘Population density (people per sq. km of land area)’ across all years? (i.e., which country has the highest average ranking in this category across each time point in the dataset?).
count_high_pop_dens <- gapminder_clean %>% group_by(`Country Name`) %>% summarize(av_pop_dens = mean(`Population density (people per sq. km of land area)`, na.rm = T)) %>% top_n(n=1, wt=av_pop_dens) %>% print()
## # A tibble: 1 × 2
## `Country Name` av_pop_dens
## <chr> <dbl>
## 1 Macao SAR, China 14732.
The country with the highest ‘Population density (people per sq. km of land area)’ across all years in Macao SAR, China.
2.9. What country (or countries) has shown the greatest increase in ‘Life expectancy at birth, total (years)’ since 1962?
count_incr_life_expect <- gapminder_clean %>% select(`Country Name`, `Life expectancy at birth, total (years)`, "Year") %>% filter (Year==1962 | Year== max(Year)) %>% group_by(`Country Name`)%>% summarize(inc_life_expect = `Life expectancy at birth, total (years)`[Year==max(Year)]-`Life expectancy at birth, total (years)`[Year==min(Year)]) %>% top_n(n=1, wt=inc_life_expect) %>% print()
## # A tibble: 1 × 2
## `Country Name` inc_life_expect
## <chr> <dbl>
## 1 Maldives 36.9
Maldives is the country that shows a greatest increase in life expectancy at birth since 1962, of around 37 years.